# Multimodal Processing

Bart Large Empathetic Dialogues
This model is based on the transformers library, and its specific purpose and functionality require further information to determine.
Large Language Model Transformers
B
sourname
199
1
Openclip ViT H 14 FARE2
MIT
A robust image encoder model based on the Transformers library, focused on image feature extraction tasks
Large Language Model Transformers
O
RCLIP
24
0
Mixtex Finetune
MIT
MixTex base_ZhEn is an image-to-text model supporting both Chinese and English, released under the MIT License.
Image-to-Text Supports Multiple Languages
M
wzmmmm
27
0
Gemma 3 Glitter 4B
Optimized model based on Gemma 3 4B, using the same data mixing scheme as Glitter 12b
Large Language Model
G
allura-org
140
3
Ola 7b
Apache-2.0
Ola-7B is a multimodal large language model jointly developed by Tencent, Tsinghua University, and Nanyang Technological University. Based on the Qwen2.5 architecture, it supports processing text, image, video, and audio inputs and generates text outputs.
Multimodal Fusion Supports Multiple Languages
O
THUdyh
1,020
37
Florence 2 FT DocVQA
MIT
A document visual question answering model fine-tuned based on Florence-2-base, specifically designed for handling QA tasks in document images.
Image-to-Text Transformers English
F
sahilnishad
4,928
0
Longvu Llama3 2 1B
Apache-2.0
LongVU is a spatio-temporal adaptive compression technology designed for long video language understanding, aiming to efficiently process long video content and enhance language comprehension.
Video-to-Text PyTorch
L
Vision-CAIR
465
11
Oryx 1.5 7B
Apache-2.0
Oryx-1.5-7B is a 7B-parameter model developed based on the Qwen2.5 language model, supporting a 32K token context window and specializing in efficiently processing visual inputs of arbitrary spatial dimensions and durations.
Text-to-Video Supports Multiple Languages
O
THUdyh
133
7
Longvu Llama3 2 3B
Apache-2.0
LongVU is a spatio-temporal adaptive compression technology for long video language understanding, designed to efficiently process long video content.
Video-to-Text PyTorch
L
Vision-CAIR
1,079
7
H2ovl Mississippi 800m
Apache-2.0
An 800M-parameter vision-language model from H2O.ai, specializing in OCR and document understanding with excellent performance
Image-to-Text Transformers English
H
h2oai
77.67k
33
Florence 2 DocVQA
A version fine-tuned for 1 day using the Docmatix dataset (5% data volume) based on Microsoft's Florence-2 model, suitable for image-text understanding tasks
Text-to-Image Transformers
F
impactframes
30
1
Florence 2 Large Florence 2 Large Nsfw Pretrain Gt
This model is based on the transformers library, and its specific functions and uses require further information for confirmation.
Large Language Model Transformers
F
ljnlonoljpiljm
55
6
Ucmt Sam On Depth
MIT
A mask generation model implemented in PyTorch, integrated and pushed to the Hub via PytorchModelHubMixin
Image Segmentation
U
weihao1115
35
1
Ecot Openvla 7b Oxe
A pretrained Transformer model for robotic control tasks, supporting basic functions such as motion planning and object grasping
Large Language Model Transformers
E
Embodied-CoT
2,003
0
Horus OCR
Donut is a Transformer-based image-to-text model capable of extracting and generating textual content from images.
Image-to-Text Transformers
H
TeeA
21
0
Icon Captioning Model
Bsd-3-clause
This is an image caption generation model based on the BLIP architecture, specifically designed to generate text descriptions for icons or simple images.
Image-to-Text Transformers
I
Revrse
98
5
Fine Tuned Rvl Cdip
A fine-tuned version of the microsoft/layoutlmv3-base model for document image classification tasks, achieving an F1 score of 0.8177 on the evaluation set
Text Recognition Transformers
F
davidhajdu
21
1
Interpret Cxr Impression Baseline
This model can convert medical images (such as X-rays) into descriptive text to assist in medical diagnosis.
Image-to-Text Transformers
I
IAMJB
17
0
Output LayoutLMv3 V7
A document understanding model fine-tuned based on microsoft/layoutlmv3-base, excelling in document layout analysis tasks
Text Recognition Transformers
O
Noureddinesa
18
1
Donut Base Handwriting Recognition
MIT
Handwriting recognition model fine-tuned based on naver-clova-ix/donut-base
Text Recognition Transformers
D
Cdywalst
140
1
Llava Maid 7B DPO GGUF
LLaVA is a large language and vision assistant model capable of handling multimodal tasks involving images and text.
Image-to-Text
L
megaaziib
99
4
Docllm Baichuan2 7b
DocLLM_reimplementation is a large language model implementation project for document understanding tasks, aimed at reimplementing and improving document comprehension capabilities.
Large Language Model Transformers
D
JinghuiLuAstronaut
185
5
Chart To Table
Apache-2.0
This model is designed to convert charts into structured tables, built on the UniChart architecture, with generated tables using specific delimiters to represent row and column structures.
Image-to-Text Transformers English
C
khhuang
345
17
Trained Model
This model is a fine-tuned version of microsoft/layoutlmv2-base-uncased on the generator dataset, suitable for document understanding and layout analysis tasks.
Large Language Model Transformers
T
vfu
14
0
Git Base Next Refined
MIT
Fine-tuned image-to-text model based on microsoft/git-base
Large Language Model Transformers Other
G
swaroopajit
24
0
Git Base Next
MIT
Fine-tuned image-to-text model based on microsoft/git-base
Image-to-Text Transformers Other
G
swaroopajit
19
1
Nougat
This model is outdated. It is recommended to use the official Nougat model. Nougat is an advanced vision-language model focused on document understanding and analysis.
Image-to-Text Transformers
N
nielsr
14
4
Git Base Fashion
MIT
An image-to-text model fine-tuned from microsoft/git-base, specialized for the fashion domain
Image-to-Text Transformers Other
G
swaroopajit
41
1
Donut Trained Example 2
MIT
Model fine-tuned based on naver-clova-ix/donut-base, specific purpose not clearly stated
Large Language Model Transformers
D
anarenteriare
13
0
Thesisdonut
MIT
A model fine-tuned based on naver-clova-ix/donut-base, specific uses and functions require more information
Image-to-Text Transformers
T
Humayoun
13
0
Wavlm Bert Fusion S Emotion Russian Resd
A multimodal fusion model based on WavLM and BERT, suitable for joint speech and text task processing.
Speech Recognition Transformers
W
Aniemore
298
3
Deplot
Apache-2.0
DePlot is a visual-language reasoning model capable of converting chart images into linearized tables, enabling few-shot reasoning when combined with large language models
Image-to-Text Transformers Supports Multiple Languages
D
google
13.72k
298
Donut Base Bol
MIT
A document understanding model fine-tuned from naver-clova-ix/donut-base, suitable for image folder datasets
Text Recognition Transformers
D
prakriti42
13
0
Donut Base Ru
MIT
Model fine-tuned based on naver-clova-ix/donut-base, specific purpose not explicitly stated
Large Language Model Transformers
D
Nyaaneet
21
1
Layoutlmv2 Base Uncased Finetuned Docvqa V2
This model is a fine-tuned version of microsoft/layoutlmv2-base-uncased for document visual question answering tasks, focusing on processing text and layout information in document images.
Image-to-Text Transformers
L
MariaK
54
3
Layoutlmv3 Finetuned Funsd
A document understanding model fine-tuned on the nielsr/funsd-layoutlmv3 dataset based on microsoft/layoutlmv3-base
Text Recognition Transformers
L
Narsil
799
0
Smallcap7m
A model capable of converting image content into textual descriptions, suitable for various vision-language tasks.
Image-to-Text Transformers English
S
Yova
977
5
Donut Demo
MIT
VisionEncoderDecoder model fine-tuned on the CORD-v2 dataset for document understanding tasks
Text Recognition Transformers
D
nielsr
56
1
Layoutlmv3 Finetuned Cord
A document understanding model fine-tuned on the CORD dataset based on LayoutLMv3, excelling in document token classification tasks
Text Recognition Transformers
L
nielsr
617
12
Layoutlmv3 Finetuned Funsd
A document understanding model fine-tuned on the FUNSD dataset based on the LayoutLMv3-base model, excelling in token classification tasks for forms and documents
Text Recognition Transformers
L
nielsr
2,420
25
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
© 2025AIbase